======================================================================================
Notes from: 9/03/05/15:00
For Final Plots and time permitting those above
## [1] "Wed Mar 6 23:19:31 2019"
Install packages & libraries.
Load data
The raw data has 113,937 rows and 81 columns. It contians information on each
loan made by the Prosper Company of San Fransico from Q4-2005 unitl Q1-2014.
## [1] 113937 81
Input data provides specific information about: loan amount, borrower rate
(or interest rate), borrower APR, current loan status, borrower income, borrower
employment status and duration, borrower credit history, and the latest payment information.
## ListingKey ListingNumber
## 17A93590655669644DB4C06: 6 Min. : 4
## 349D3587495831350F0F648: 4 1st Qu.: 400919
## 47C1359638497431975670B: 4 Median : 600554
## 8474358854651984137201C: 4 Mean : 627886
## DE8535960513435199406CE: 4 3rd Qu.: 892634
## 04C13599434217079754AEE: 3 Max. :1255725
## (Other) :113912
## ListingCreationDate CreditGrade Term
## 2013-10-02 17:20:16.550000000: 6 :84984 Min. :12.00
## 2013-08-28 20:31:41.107000000: 4 C : 5649 1st Qu.:36.00
## 2013-09-08 09:27:44.853000000: 4 D : 5153 Median :36.00
## 2013-12-06 05:43:13.830000000: 4 B : 4389 Mean :40.83
## 2013-12-06 11:44:58.283000000: 4 AA : 3509 3rd Qu.:36.00
## 2013-08-21 07:25:22.360000000: 3 HR : 3508 Max. :60.00
## (Other) :113912 (Other): 6745
## LoanStatus ClosedDate
## Current :56576 :58848
## Completed :38074 2014-03-04 00:00:00: 105
## Chargedoff :11992 2014-02-19 00:00:00: 100
## Defaulted : 5018 2014-02-11 00:00:00: 92
## Past Due (1-15 days) : 806 2012-10-30 00:00:00: 81
## Past Due (31-60 days): 363 2013-02-26 00:00:00: 78
## (Other) : 1108 (Other) :54633
## BorrowerAPR BorrowerRate LenderYield
## Min. :0.00653 Min. :0.0000 Min. :-0.0100
## 1st Qu.:0.15629 1st Qu.:0.1340 1st Qu.: 0.1242
## Median :0.20976 Median :0.1840 Median : 0.1730
## Mean :0.21883 Mean :0.1928 Mean : 0.1827
## 3rd Qu.:0.28381 3rd Qu.:0.2500 3rd Qu.: 0.2400
## Max. :0.51229 Max. :0.4975 Max. : 0.4925
## NA's :25
## EstimatedEffectiveYield EstimatedLoss EstimatedReturn
## Min. :-0.183 Min. :0.005 Min. :-0.183
## 1st Qu.: 0.116 1st Qu.:0.042 1st Qu.: 0.074
## Median : 0.162 Median :0.072 Median : 0.092
## Mean : 0.169 Mean :0.080 Mean : 0.096
## 3rd Qu.: 0.224 3rd Qu.:0.112 3rd Qu.: 0.117
## Max. : 0.320 Max. :0.366 Max. : 0.284
## NA's :29084 NA's :29084 NA's :29084
## ProsperRating..numeric. ProsperRating..Alpha. ProsperScore
## Min. :1.000 :29084 Min. : 1.00
## 1st Qu.:3.000 C :18345 1st Qu.: 4.00
## Median :4.000 B :15581 Median : 6.00
## Mean :4.072 A :14551 Mean : 5.95
## 3rd Qu.:5.000 D :14274 3rd Qu.: 8.00
## Max. :7.000 E : 9795 Max. :11.00
## NA's :29084 (Other):12307 NA's :29084
## ListingCategory..numeric. BorrowerState
## Min. : 0.000 CA :14717
## 1st Qu.: 1.000 TX : 6842
## Median : 1.000 NY : 6729
## Mean : 2.774 FL : 6720
## 3rd Qu.: 3.000 IL : 5921
## Max. :20.000 : 5515
## (Other):67493
## Occupation EmploymentStatus
## Other :28617 Employed :67322
## Professional :13628 Full-time :26355
## Computer Programmer : 4478 Self-employed: 6134
## Executive : 4311 Not available: 5347
## Teacher : 3759 Other : 3806
## Administrative Assistant: 3688 : 2255
## (Other) :55456 (Other) : 2718
## EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
## Min. : 0.00 False:56459 False:101218
## 1st Qu.: 26.00 True :57478 True : 12719
## Median : 67.00
## Mean : 96.07
## 3rd Qu.:137.00
## Max. :755.00
## NA's :7625
## GroupKey DateCreditPulled
## :100596 2013-12-23 09:38:12: 6
## 783C3371218786870A73D20: 1140 2013-11-21 09:09:41: 4
## 3D4D3366260257624AB272D: 916 2013-12-06 05:43:16: 4
## 6A3B336601725506917317E: 698 2014-01-14 20:17:49: 4
## FEF83377364176536637E50: 611 2014-02-09 12:14:41: 4
## C9643379247860156A00EC0: 342 2013-09-27 22:04:54: 3
## (Other) : 9634 (Other) :113912
## CreditScoreRangeLower CreditScoreRangeUpper
## Min. : 0.0 Min. : 19.0
## 1st Qu.:660.0 1st Qu.:679.0
## Median :680.0 Median :699.0
## Mean :685.6 Mean :704.6
## 3rd Qu.:720.0 3rd Qu.:739.0
## Max. :880.0 Max. :899.0
## NA's :591 NA's :591
## FirstRecordedCreditLine CurrentCreditLines OpenCreditLines
## : 697 Min. : 0.00 Min. : 0.00
## 1993-12-01 00:00:00: 185 1st Qu.: 7.00 1st Qu.: 6.00
## 1994-11-01 00:00:00: 178 Median :10.00 Median : 9.00
## 1995-11-01 00:00:00: 168 Mean :10.32 Mean : 9.26
## 1990-04-01 00:00:00: 161 3rd Qu.:13.00 3rd Qu.:12.00
## 1995-03-01 00:00:00: 159 Max. :59.00 Max. :54.00
## (Other) :112389 NA's :7604 NA's :7604
## TotalCreditLinespast7years OpenRevolvingAccounts
## Min. : 2.00 Min. : 0.00
## 1st Qu.: 17.00 1st Qu.: 4.00
## Median : 25.00 Median : 6.00
## Mean : 26.75 Mean : 6.97
## 3rd Qu.: 35.00 3rd Qu.: 9.00
## Max. :136.00 Max. :51.00
## NA's :697
## OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries
## Min. : 0.0 Min. : 0.000 Min. : 0.000
## 1st Qu.: 114.0 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 271.0 Median : 1.000 Median : 4.000
## Mean : 398.3 Mean : 1.435 Mean : 5.584
## 3rd Qu.: 525.0 3rd Qu.: 2.000 3rd Qu.: 7.000
## Max. :14985.0 Max. :105.000 Max. :379.000
## NA's :697 NA's :1159
## CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.0 1st Qu.: 0.000
## Median : 0.0000 Median : 0.0 Median : 0.000
## Mean : 0.5921 Mean : 984.5 Mean : 4.155
## 3rd Qu.: 0.0000 3rd Qu.: 0.0 3rd Qu.: 3.000
## Max. :83.0000 Max. :463881.0 Max. :99.000
## NA's :697 NA's :7622 NA's :990
## PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance
## Min. : 0.0000 Min. : 0.000 Min. : 0
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 3121
## Median : 0.0000 Median : 0.000 Median : 8549
## Mean : 0.3126 Mean : 0.015 Mean : 17599
## 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.: 19521
## Max. :38.0000 Max. :20.000 Max. :1435667
## NA's :697 NA's :7604 NA's :7604
## BankcardUtilization AvailableBankcardCredit TotalTrades
## Min. :0.000 Min. : 0 Min. : 0.00
## 1st Qu.:0.310 1st Qu.: 880 1st Qu.: 15.00
## Median :0.600 Median : 4100 Median : 22.00
## Mean :0.561 Mean : 11210 Mean : 23.23
## 3rd Qu.:0.840 3rd Qu.: 13180 3rd Qu.: 30.00
## Max. :5.950 Max. :646285 Max. :126.00
## NA's :7604 NA's :7544 NA's :7544
## TradesNeverDelinquent..percentage. TradesOpenedLast6Months
## Min. :0.000 Min. : 0.000
## 1st Qu.:0.820 1st Qu.: 0.000
## Median :0.940 Median : 0.000
## Mean :0.886 Mean : 0.802
## 3rd Qu.:1.000 3rd Qu.: 1.000
## Max. :1.000 Max. :20.000
## NA's :7544 NA's :7544
## DebtToIncomeRatio IncomeRange IncomeVerifiable
## Min. : 0.000 $25,000-49,999:32192 False: 8669
## 1st Qu.: 0.140 $50,000-74,999:31050 True :105268
## Median : 0.220 $100,000+ :17337
## Mean : 0.276 $75,000-99,999:16916
## 3rd Qu.: 0.320 Not displayed : 7741
## Max. :10.010 $1-24,999 : 7274
## NA's :8554 (Other) : 1427
## StatedMonthlyIncome LoanKey TotalProsperLoans
## Min. : 0 CB1B37030986463208432A1: 6 Min. :0.00
## 1st Qu.: 3200 2DEE3698211017519D7333F: 4 1st Qu.:1.00
## Median : 4667 9F4B37043517554537C364C: 4 Median :1.00
## Mean : 5608 D895370150591392337ED6D: 4 Mean :1.42
## 3rd Qu.: 6825 E6FB37073953690388BC56D: 4 3rd Qu.:2.00
## Max. :1750003 0D8F37036734373301ED419: 3 Max. :8.00
## (Other) :113912 NA's :91852
## TotalProsperPaymentsBilled OnTimeProsperPayments
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 9.00 1st Qu.: 9.00
## Median : 16.00 Median : 15.00
## Mean : 22.93 Mean : 22.27
## 3rd Qu.: 33.00 3rd Qu.: 32.00
## Max. :141.00 Max. :141.00
## NA's :91852 NA's :91852
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 0.00
## Mean : 0.61 Mean : 0.05
## 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :42.00 Max. :21.00
## NA's :91852 NA's :91852
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## Min. : 0 Min. : 0
## 1st Qu.: 3500 1st Qu.: 0
## Median : 6000 Median : 1627
## Mean : 8472 Mean : 2930
## 3rd Qu.:11000 3rd Qu.: 4127
## Max. :72499 Max. :23451
## NA's :91852 NA's :91852
## ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## Min. :-209.00 Min. : 0.0
## 1st Qu.: -35.00 1st Qu.: 0.0
## Median : -3.00 Median : 0.0
## Mean : -3.22 Mean : 152.8
## 3rd Qu.: 25.00 3rd Qu.: 0.0
## Max. : 286.00 Max. :2704.0
## NA's :95009
## LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber
## Min. : 0.00 Min. : 0.0 Min. : 1
## 1st Qu.: 9.00 1st Qu.: 6.0 1st Qu.: 37332
## Median :14.00 Median : 21.0 Median : 68599
## Mean :16.27 Mean : 31.9 Mean : 69444
## 3rd Qu.:22.00 3rd Qu.: 65.0 3rd Qu.:101901
## Max. :44.00 Max. :100.0 Max. :136486
## NA's :96985
## LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter
## Min. : 1000 2014-01-22 00:00:00: 491 Q4 2013:14450
## 1st Qu.: 4000 2013-11-13 00:00:00: 490 Q1 2014:12172
## Median : 6500 2014-02-19 00:00:00: 439 Q3 2013: 9180
## Mean : 8337 2013-10-16 00:00:00: 434 Q2 2013: 7099
## 3rd Qu.:12000 2014-01-28 00:00:00: 339 Q3 2012: 5632
## Max. :35000 2013-09-24 00:00:00: 316 Q2 2012: 5061
## (Other) :111428 (Other):60343
## MemberKey MonthlyLoanPayment LP_CustomerPayments
## 63CA34120866140639431C9: 9 Min. : 0.0 Min. : -2.35
## 16083364744933457E57FB9: 8 1st Qu.: 131.6 1st Qu.: 1005.76
## 3A2F3380477699707C81385: 8 Median : 217.7 Median : 2583.83
## 4D9C3403302047712AD0CDD: 8 Mean : 272.5 Mean : 4183.08
## 739C338135235294782AE75: 8 3rd Qu.: 371.6 3rd Qu.: 5548.40
## 7E1733653050264822FAA3D: 8 Max. :2251.5 Max. :40702.39
## (Other) :113888
## LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees
## Min. : 0.0 Min. : -2.35 Min. :-664.87
## 1st Qu.: 500.9 1st Qu.: 274.87 1st Qu.: -73.18
## Median : 1587.5 Median : 700.84 Median : -34.44
## Mean : 3105.5 Mean : 1077.54 Mean : -54.73
## 3rd Qu.: 4000.0 3rd Qu.: 1458.54 3rd Qu.: -13.92
## Max. :35000.0 Max. :15617.03 Max. : 32.06
##
## LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## Min. :-9274.75 Min. : -94.2 Min. : -954.5
## 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0.00 Median : 0.0 Median : 0.0
## Mean : -14.24 Mean : 700.4 Mean : 681.4
## 3rd Qu.: 0.00 3rd Qu.: 0.0 3rd Qu.: 0.0
## Max. : 0.00 Max. :25000.0 Max. :25000.0
##
## LP_NonPrincipalRecoverypayments PercentFunded Recommendations
## Min. : 0.00 Min. :0.7000 Min. : 0.00000
## 1st Qu.: 0.00 1st Qu.:1.0000 1st Qu.: 0.00000
## Median : 0.00 Median :1.0000 Median : 0.00000
## Mean : 25.14 Mean :0.9986 Mean : 0.04803
## 3rd Qu.: 0.00 3rd Qu.:1.0000 3rd Qu.: 0.00000
## Max. :21117.90 Max. :1.0125 Max. :39.00000
##
## InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
## Min. : 0.00000 Min. : 0.00 Min. : 1.00
## 1st Qu.: 0.00000 1st Qu.: 0.00 1st Qu.: 2.00
## Median : 0.00000 Median : 0.00 Median : 44.00
## Mean : 0.02346 Mean : 16.55 Mean : 80.48
## 3rd Qu.: 0.00000 3rd Qu.: 0.00 3rd Qu.: 115.00
## Max. :33.00000 Max. :25000.00 Max. :1189.00
##
Based on goals of this analysis 54 original columns are retained and the rest
were dropped. Where possible, column names will be shortened.
## [1] "ListingCreationDate" "CreditGrade"
## [3] "Term" "LoanStatus"
## [5] "ClosedDate" "BorrowerAPR"
## [7] "BorrowerRate" "RatingNumeric"
## [9] "RatingAlpha" "ProsperScore"
## [11] "LoanCategory" "BorrowerState"
## [13] "Occupation" "EmploymentStatus"
## [15] "EmploymentDuration" "HomeMortgage"
## [17] "DateCreditPulled" "CreditScoreLower"
## [19] "CreditScoreUpper" "FirstCreditLineDate"
## [21] "CurrentCreditLines" "OpenCreditLines"
## [23] "CreditLinesPast7yrs" "RevolvingAccounts"
## [25] "RevolvingMonthlyPayment" "InquiriesLast6Months"
## [27] "TotalInquiries" "CurrentDelinquencies"
## [29] "AmountDelinquent" "OverdueLast7Years"
## [31] "RevolvingCreditBalance" "BankcardUtilization"
## [33] "AvailableBankcardCredit" "DebtToIncomeRatio"
## [35] "IncomeRange" "IncomeVerifiable"
## [37] "StatedMonthlyIncome" "TotalProsperLoans"
## [39] "OnTimeProsperPayments" "LatePayUnder30days"
## [41] "LatePayOver30days" "LoanDaysDelinquent"
## [43] "InitialDefaultCycle" "LoanDurationMonths"
## [45] "LoanNumber" "LoanOriginalAmount"
## [47] "LoanOriginationDate" "LoanOriginationQuarter"
## [49] "MonthlyLoanPayment" "PercentFunded"
## [51] "Recommendations" "InvestorsFriendsCount"
## [53] "FriendsAmountInvested" "Investors"
The table below is a six statistic summary of the ‘BorrrowerAPR’ column.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229 25
The following four barplots of the ‘BorrowerAPR’ column show frequency on the
y axis and Borrower’s APR on the x axis.
The first plot has a binwidth of 0.05. In the next three plots the binwidth was
gradually decreased. Smaller binwidths show the extent to which Prosper uses
fine increments in interest rates.
A high frequency of loans are made at a rate of roughly 0.365 percent however
this fact is not visibly apparent when the chosen binwidth is 0.05.
We can now see that there is a hidden spike in loan frequencies at the roughly
the 0.365 rate using a binwidth of 0.01 rather than 0.05.
By decreasing the binwidth even further we notice the incremental difference
interest rates being charged by the Prosper company doesn’t seem feasible.
My conclusion is that these interest rates are randomly generated by the
individual or individuals providing the data and that these are not true
interest rates being charged by banks because there are 6677 different rates
mostly with a difference of 0.00001 which is extremely unlikely. Had I not
explored using different binwidths I probably would not have discovered this.
The six statistic summary below of ‘BorrowersRate’ (interest rate) shows that
most interest rate are between 0.134 and 0.498 percent.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
The plot of ‘BorrowerRate’ below using a binwidth of 0.001 increases the
granulartiy visible in the dispersion of interest rate values.
The next plot below displays the ‘BorrowerRate’ distribution as a density
for comparison to the bar plot above. This density curve shows the relative
percentage of total on the y axis rather than count as in the plot above.
The concentration of loans around the rate of 0.14 and the spike in frequency
at the 3.25 rate level are noticable.
The six statistic summary of ‘LoanTerm’ below indicates the median loan term
is 36 months a minimum of 12 months and a maximum of 60 months.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 36.00 36.00 40.83 36.00 60.00
The histogram of ‘LoanTerm’ below identifies three loan term lengths
as either (12, 36, or 60) months, the most frequent being 36 months visually
repeating the results of the summary statistics above.
The table of the ‘BorrowerState’ below shows counts of loans made by each state.
##
## AK AL AR AZ CA CO CT DC DE FL GA
## 5515 200 1679 855 1901 14717 2210 1627 382 300 6720 5008
## HI IA ID IL IN KS KY LA MA MD ME MI
## 409 186 599 5921 2078 1062 983 954 2242 2821 101 3593
## MN MO MS MT NC ND NE NH NJ NM NV NY
## 2318 2615 787 330 3084 52 674 551 3097 472 1090 6729
## OH OK OR PA RI SC SD TN TX UT VA VT
## 4197 971 1817 2972 435 1122 189 1737 6842 877 3278 207
## WA WI WV WY
## 3048 1842 391 150
The following barplot, shows the count of loans by state on the y axis and
the x axis lists each state. The first bar in the lower left of the graph shows
over 5 thousand borrowers had not indicated which state they were located in.
not provide the state information to the lender.
The following table of EmploymentStatus’ gives counts for each employment group.
## Employed Full-time Not available Not employed
## 2255 67322 26355 5347 835
## Other Part-time Retired Self-employed
## 3806 1088 795 6134
The following barchart helps visualize the counts of the different employment
groups Prosper is making loans to.
Columns 1,4 & 6 contain 11,408 loans without a without a clearly defined
employment type. Perhaps these loans weren’t made to individuals but to some
kind of business or organization.
The table below of ‘HomeMortgage’, tells us the number of borrowers who have a
home mortgage and the number of those who don’t have one. Its clear from these
counts that Prosper makes loans to both those with and without a home mortgage
in about the same frequency.
## False True
## 56459 57478
The following table on ‘IncomeRange’ shows the counts borrowers in each grouping.
Two of the values don’t represent income ranges which account for 8,547 loans.
The order of the columns are not in an sensible order like increasing from left
to right.
## $0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999
## 621 7274 17337 32192 31050
## $75,000-99,999 Not displayed Not employed
## 16916 7741 806
In the following barplot of ‘IncomeRange’ plots the same information in the
above table. This will be changed in the plot following the one below.
After converting the ‘IncomeRange’ variable into an ordered factor we re-plot it
to create a more accurate visualization of the actual income distribution for the
rows containing valid income ranges.
The table below divides ‘IncomeVerifiable’ into the counts of yes and no values. no.
## False True
## 8669 105268
The barplot ‘IncomeVerifiable’ below, shows the same difference in yes and no
counts visually. A visual comparison seems to provide a more meaningful message.
The following table of ‘LoanOrigainationQuarter’, is an un-ordered count of
loans made by Prosper company druing each quarter.
##
## Q1 Q2 Q3 Q4
## 29678 24906 27967 31386
The following plots shows the count of loans made by Prosper each quarter.
The table below provides the number of loans originated by year.
##
## 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## 22 5906 11460 11552 2047 5652 11228 19553 34345 12172
The following barplot of “OriginationYear” display counts of loans originated
yearly by the Prosper.
Added a new column called ‘LoanCode’ to the data frame uni_data that correctly
renames the the loan category values that are currently integers with a human
readable value using a string.
After producing a summary table of the ‘LoanCategory’ I changed the data type
of ‘LoanCategory’ from integer to factor creating a different table showing the
counts for each loan category.
## Not Available Debt Consolidation Home Improvement
## 16965 58308 7433
## Business Personal Loan Student Use
## 7189 2395 756
## Auto Other Baby Adoption
## 2572 10494 199
## Boat Cosmetic Procedure Engagement Ring
## 85 91 217
## Green Loans Household Expenses Large Purchases
## 59 1996 876
## Medical / Dental Motorcycle RV
## 1522 304 52
## Taxes Vacation Wedding Loans
## 885 768 771
The bar plot below shows the ‘LoanCode’ counts of 20 categories of loans
the Prosper company makes, consolidation loans being the most frequent.
The table below summarizes the counts of the eight credit grades.
## A AA B C D E HR NC
## 84984 3315 3509 4389 5649 5153 3289 3508 141
The bar plot below displays the eight ‘CreditGrades’ distribution that includes
a group of 80K unclassified loan grades. These credit grades only apply to loans
prior to the year 2009 and therefore this variable is not consistant across the
data set and should be used with caution in any calculations.
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
The barchart below shows the loan fruquencies of each ‘LoanStatus’ category.
The table below summarizes the number of CurrentDelinquencies.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.5921 0.0000 83.0000 697
The histogram below shows the frequency of the ‘LoanDaysDelinquent’ variable.
The following cell displays the results of the summmary function producing a six
statistic table of the CreditScoreRangeLower and CreditScoreRangeUpper.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 660.0 680.0 685.6 720.0 880.0 591
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 19.0 679.0 699.0 704.6 739.0 899.0 591
Added ‘CreditScoreMean’ column by calculating the mean of the difference between
‘CreditScoreRangeLower’ and ‘CreditScoreRangeUpper’. The followign plots shows
the mean of the added column CreditScoreMean in a dark red dashed verticle line.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 10.0 670.0 690.0 695.6 730.0 890.0 591
## Mean
## 695.5677
## [1] "Tue Mar 5 13:42:52 2019"
The initial raw data set contained 113,937 (observations / rows), with 81
(variables / columns), on each loan, such as: loan amount, borrower rate
(or interest rate), current loan status, borrower income, borrower employment
status, borrower credit history, and the latest payment information, to name
a few. In the early stage of this analysis twenty seven rows were dropped
bringing the number of columns at this stage of the analysis to 54.
## [1] 113937 59
The cell below shows the structure of the cleaned data frame that was used in
the univatiate analysis.
## 'data.frame': 113937 obs. of 57 variables:
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ RatingNumeric : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ RatingAlpha : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ LoanCategory : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ HomeMortgage : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstCreditLineDate : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ CreditLinesPast7yrs : int 12 29 3 29 49 49 20 10 32 32 ...
## $ RevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ RevolvingMonthlyPayment: num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ OverdueLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit: num 1500 10266 NA 30754 695 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ LatePayUnder30days : int NA NA NA NA 0 NA NA NA NA NA ...
## $ LatePayOver30days : int NA NA NA NA 0 NA NA NA NA NA ...
## $ LoanDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InitialDefaultCycle : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanDurationMonths : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestorsFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FriendsAmountInvested : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
## $ IncomeRange_ordered : Ord.factor w/ 8 levels "Not employed"<..: 5 6 2 5 8 8 5 5 5 5 ...
## $ LoanCode : Factor w/ 21 levels "Not Available",..: 1 3 1 17 3 2 2 3 8 8 ...
## $ CreditScoreMean : num 650 690 490 810 690 750 690 710 830 830 ...
Features relating to the five C’s of Credit analysis: (capacity, capital,
conditions, character & collateral). Because most of the loans Prosper makes
are consolidation loans and not home loans varialbes representing colateral are
not present in this data.
Supporting features of interest
Columns 26 & 27 (used to create the CreditScoreMean column ‘55’ in the
uni_data so that the standard deviation of the CreditScoreMean could be
calculated to enalbe analysis of the spread in CreditScoreMeans as they
relate to the 5 C’s.
Added the ‘LoanCode’ column ’ changing the Loan Cateories from numerics
vlues to string values in order to generate meaningful axis lables on the
plots. This brought the total column count up to 56.
Added ‘OriginationQuarter’ & ‘OriginationYear; columns by extracting two strings from the ’LoanOriginationQarter’ column bringing the column count up to
58 columns.
Added ‘IncomeRage_ordered’ to rearrange the vlaues in the ‘IncomeRange’
column in ascending order for those columns that had monetary values bringing
the column count up to 59.
The BorrowerAPR was very unusual in that, although there were over 113,000
different loans contained in this data set, there were 6677 different interest
rates each separated by a difference of just 0.00001 percent. It would be un-
realistic for a bank to be able to apply this many different rates to its pool
of borrowers. With out performing this exploratory analysis I wouldn’t have
stumbled upon this finding.
Columns Modified
Used the lubridate package year function to capature the date field that
enabled plotting of the data in the “LoanOringinationDate” column.
This transformation was done to enable plotting of time data.
Ordered factor variables in the ‘IncomeRange’ column improving plot readability.
Rotated the x_tic labels in three plots to improve the display of those labels’
EmplymentStatus
IncomeRange
LoanCode (replaced the LoanCategory column for plotting this data)
Columns dropped:
columns 1,2 (listing numbers aren“t” any of the 5 C’s of loan risk analysis (Capital,
Capacity, Conditions, Character or Colateral)
columns 23:24 (insufficient information is available regarding these values)
column 46 (not sure what kinds of Trades these are and how they relate to the 5 C’s)
column 52:58 (each column had roughly 90K NAs or missing data so only 10% of the
observations provide information on this feature)
columns 59 & 61 (each columns has over 95K NAs, same as above)
columns 69:72 (columns relate to costs and profitability of Prosper’s business
operations without giving information relevant to the 5 C’s.
Rename (copy) uni_data to bi_data for further bivariate analysis
The data frame above was renamed to bi_data and will be used for the bivariate
section of analysis.
The following eleven plots use the ggpairs function to detect if a correlation
exists between the two variables in the plots.
## [1] "Mon Mar 4 00:30:04 2019"
## [1] "Mon Mar 4 00:30:13 2019"
## [1] "Mon Mar 4 00:30:23 2019"
## [1] "Mon Mar 4 00:30:27 2019"
## [1] "Mon Mar 4 00:30:31 2019"
## [1] "Mon Mar 4 00:30:35 2019"
## [1] "Mon Mar 4 00:30:45 2019"
## [1] "Mon Mar 4 00:30:54 2019"
## [1] "Mon Mar 4 00:31:03 2019"
## [1] "Mon Mar 4 00:31:13 2019"
## [1] "Mon Mar 4 00:31:21 2019"
Bivariate Plot 1 analysis
BorrowerRate vs. CurrentDelinquencies were the two variables in the first
bivariate plot. The results show a correlation of 0.177 indicating a weak
correlation.
Bivariate Plot 2 analysis
LoanDurationMonths vs. CurrentDelinquencies were the two variables plotted in
the second bivariate plot producing a weak correlation coefficeint of 0.248.
Bivariate Plot 3 analysis
EmploymentStatus vs. CurrentDelinquecies were used in the third bivariate plot.
In this plot a correlation calculation was not returned with the plot because
the inputs are non numeric.
Bivariate Plot 4 analysis
IncomeVerfiable vs. CurrentDelinquencies were plotted in the fourth bivariate
plot. From this plot we can just see that IncomeVerifiable shows less variance. This is probably mainly becuase lenders are less likely to lend to borrowers
without a verifiable income. Therefore the lower numbers of loans generate a
lower variance.
Bivariate Plot 5 analysis
IncomeRange_ordered vs. CurrentDelinquencies were plotted in the 5th bivarate
plot. This plot visibly shows that the higher the income range the lower the
CurrentDelinquencies rate.
Bivariate Plot 6 analysis
OverdueLast7Year vs CurrentDelinquencies were the two variables plotted in the
sixth bivariate plot. A correlation coefficient was calculated since both of
these varaibles are numeric which turns out to be .378 indicating there is
significance in the relationship between these two variables. So would be
correct to say that the ‘OverdueLast7Years’ variable is related to the level of
‘CurrentDelinquencies’.
Bivariate Plot 7 analysis
CreditScoreMean vs. CurrentDelinquencies were the two varaibles plotted in the
seventh bivaraite plot. It turns out that, there is a significant negative
correlation between these two varaibles which was calculated to be -0.368.
In other words as the ‘CreditScoreMean’ increases, the ‘CurrentDelinquencies’
rate decreases. This explains why Credit Score is such a key component of loan
risk analysis when being used by lenders to evaluate whether or not to grant a
loan request.
Bivariate Plot 8 analysis
‘FriendsAmountInvested’ vs. ‘CurrentDelinquencies’ were the two varaibles plot-
ted in this bivaraite plot. The returned correlation cooefficient of 0.0153 is
not considered significant for this analysis.
Bivariate Plot 9 analysis
‘TotalInquiries’ vs. ‘Investors’ were the two variables used in this plot.
A correlation coefficient of 0.0263 was returned meaning little correlation is
present among these two varaibles.
Bivariate Plot 10 analysis
‘FreindsAmountInvested vs. ’DebtToIncomeRatio’ were the two varialbes used in
this bivariate plot. The result was a correlation coefficient of only 0.0279,
not an indication of a significant correlation between these two variables.
Bivariate Plot 11 analysis
‘CreditScoreLower’ vs. ‘LoanOriginationAmount’ were the two variables used in
this bivaraiate plot. A correlation coefficient of 0.341 was returned, meaning
the relationship between these two variables is at the lower end of the signifi-
cant level.
Bivariate Plot Analysis Summary
The three variables with the highest correlation coefficients were the ‘Overdue
Last7Years’ at 0.378, then ‘CreditScoreMean at -0.068, and finally ’CreditScore
Lower’ at 0.341. The variable with the next highest correlation was ‘Freinds
AmountInvested’. Analyzing these results we can conclude that a persons past
credit performance for the most part along with some influence from friends
support might provide many of the characteristics that lend to predicting future
loan outcomes.
The bivariate plot of ‘CreditScoreLower’ vs. ‘LoanOriginalAmount’ calculates a
correlation coefficient of 0.341 which was one of the top three strongest
correlations discovered thus far.
The strongest relationship btween two variables I found thus far was between
the ‘OverdueLast7Years’ vs. the ‘CurrentDelinquencies’ columns. the correlation
cooeficiennt was calculated to be 0.378.
Renamed (copied bi_data set) to multi_data for further multivariate analysis
Eleven multivariate plots considering different relationships between three or
more variables will follow. Each of these variables were analyzed above to
determine what their relationship was with one other variable (a bivariate
analysis). I’ll be using colors, sizes and shapes to show how these additional
varaibles related with each of the first two variables. In the following plots
I’ll be looking at relationships between: CreditScoreMean, CurrentDelinquencies,
HomeMortgage, IncomeVerfiable, Term, LoanCode, DebtToIncomeRatio, BorrowerAPR,
IncomeRange_ordered, InvestorsFriendCount, FriendsAmountInvested & Occupation.
***
Multivariate_Plot_a, compares ‘CurrentDelinquencies’ along the x axis with
CreditScoreMean’ along the y axis. The value of the HomeMortgage which can
either be true or false is coded in color. In this plot I am looking to see
if having a HomeMortgage is related to CurrentDelinquencies and CreditScoreMean.
Becuase all of the red points fall mainly lower and farther to the right
borrowers with home mortgages tend to have better CreditScoreMeans and lower
variance in the number of loan payment delinquencies.
Multivariate_Plot_b, compares CreditScoreMean’ along the x axis with
‘CurrentDelinquencies’ along the y axis. In this plot however, we want to know if
having a verified income with the lender is related to delinquent loan payments
and or the borrower’s credit score mean. The low percentage of red points
compared to green points indicates very few loans are made to borrowers without
a verfied income. For those loans that have been made to borrowers without a
verified income the data suggests that few loans made to this group have the
lowest level of payment delinquencies. Very few of the red points lie along the
x axis compared with the green points.
Multivariate_Plot_c, compares’Term’ along the x axis with
‘CurrentDelinquencies’ along the y axis. It looks as though from this plot, that nearly all loans
with a 60 month term are going to borrowers with a home mortgage and the
delinquency rate among those loans is relatively low. On the other hand, loans
with a 12 month term appear to be almost evenly disributed among the borrowers
who have a home mortgage and borrowers who don’t have a home mortgage.
However, the most prevalent loan term (the 36 month term) data suggests
that payment delinquencies by non mortgage holders is far greater than the
number of payment delinquencies by the borrowers who have a home mortgage.
Moreover the frequency of current delinquencies (red points) is more prevalent
in the 36 month term category as can be seen in the higher relatvie number of
red points to green points.
Multivariate_Plot_d, compares ‘CurrentDelinquencies’ along the x axis with
‘LoanCode’ along the y axis. I’ve added ‘IncomeRange_ordered’ using color
visualizing the relationship of ‘IncomeRange_ordered’ with loan purpose listed as
‘LoanCode’ and current payment delinquencies listed as ‘CurrentDelinquencies’.
The conlusion I draw from this plot is that current payment delinquencies are
fairly evenly distributed across each of the income groupings. Other than that,
there are a few categories of loans that have fewer payment delinquencies that
most of the others, such as: “Green Loans”, “Boat Loans”, however this is most
likely due to fewer numbers of these loans being made.
Multivariate_Plot_e, compares’DebtToIncomeRatio’ along the x axis with
‘CurrentDelinquencies’ along the y axis. Using ‘IncomeRange_ordered’ with color
to stratify the current payment delinquencies by income group. An initial
observation is this lender is reluctant to approve loans above a DTI of greater
than roughly 35%, although there are some loans made over that level. It
appears that the number of current delinquencies is fairly evenly distributed
among the various income range groupings.
Multivariate_Plot_f, compares ‘BorrowerAPR’ along the x axis with
‘CurrentDelinquencies’ along the y axis. The plot indicates that the distribution of
current delinquencies is normally distributed among the 36 month loan term
shown in greed. It is less obvious what the distribution is for the 12, and 60
month loan terms.
Multivariate_Plot_g, compares ‘BorrowerAPR’ along the x axis with
‘DebtToIncomeRatio’ along the y axis. From this plot we can confirm that the
mean ‘BorrowerARR’ is roughly .22 and that the mean DTI is roughly .28. We can
also see that the distribution of each varaible potted against the other is
normally shaped. Finally, the distribution with regard to the ‘Term’ appears to
be nearly identical for each of the three term lengths.
Multivariate_Plot_h, compares ‘CurrentDelinquencies’ along the x axis with
‘IncomeRange_ordered’ along the y axis using color for the ‘Term’ variable.
This plot shows that the number of current delinquencies is fairly evenly
distributed between the upper four income ranges. Taking a second look at the
summary for the ‘IncomeRange_ordered’ varaible we see that only 621 loans were
made to borrowers in the lowest income category and that only 7274 loans out of
more than 110,000 loans were made to borrowers in the $1-24,999 income range
which is about 0.06 percent of total loans.
Multivariate_Plot_i, compares ‘CurrentDelinquencies’ along the x axis with
‘FriendsAmountInvested’ along the y axis using color with the’CreditScoreMean’
variable. As’FreindsAmountInvested. increases,‘CurrentDelinquencies’ decrease.
This is perhaps one of the fundamental pricipals in how peer funding works. By
having peers involved in the loan process the on time re payments increase.
Multivariate_Plot_j, compares ‘CreditScoreMean’ along the x axis with
‘FriendsAmountInvested’ along the y axis using color for the ‘FreindsAmountInvested’
variable. The frequency in ‘FreindsAmountInvested’ values appear to show no
difference in whether or not the borrower has a home mortgage or not.
Multivariate_Plot_k, compares ‘CurrentDelinquencies’ along the x axis with
‘Occupation’ along the y axis using color for the ‘HomeMortgage’ variable.
From this plot we can visually see that the borrowers without a home mortgage
appear to have a greater frequency of current delinquencies compared to the
borrowers who have a home mortgage. This is evidenced by the green points
coalescing on the left along the y axis and the red points gravitating toward
the right side along the y axis.
Linear Models
##
## Calls:
## f1: lm(formula = CurrentDelinquencies ~ CreditScoreMean, data = multi_data_samp_f)
## f2: lm(formula = CurrentDelinquencies ~ CreditScoreMean + HomeMortgage,
## data = multi_data_samp_f)
## f3: lm(formula = CurrentDelinquencies ~ CreditScoreMean + HomeMortgage +
## DebtToIncomeRatio, data = multi_data_samp_f)
## f4: lm(formula = CurrentDelinquencies ~ CreditScoreMean + HomeMortgage +
## DebtToIncomeRatio + BorrowerAPR, data = multi_data_samp_f)
## f5: lm(formula = CurrentDelinquencies ~ CreditScoreMean + HomeMortgage +
## DebtToIncomeRatio + BorrowerAPR + EmploymentStatus, data = multi_data_samp_f)
## f6: lm(formula = CurrentDelinquencies ~ CreditScoreMean + HomeMortgage +
## DebtToIncomeRatio + BorrowerAPR + EmploymentStatus + EmploymentDuration,
## data = multi_data_samp_f)
## f7: lm(formula = CurrentDelinquencies ~ CreditScoreMean + HomeMortgage +
## DebtToIncomeRatio + BorrowerAPR + EmploymentStatus + EmploymentDuration +
## IncomeRange, data = multi_data_samp_f)
## f8: lm(formula = CurrentDelinquencies ~ CreditScoreMean + HomeMortgage +
## DebtToIncomeRatio + BorrowerAPR + EmploymentStatus + EmploymentDuration +
## IncomeRange + Term, data = multi_data_samp_f)
## f9: lm(formula = CurrentDelinquencies ~ CreditScoreMean + HomeMortgage +
## DebtToIncomeRatio + BorrowerAPR + EmploymentStatus + EmploymentDuration +
## IncomeRange + Term + BorrowerState, data = multi_data_samp_f)
## f10: lm(formula = CurrentDelinquencies ~ CreditScoreMean + HomeMortgage +
## DebtToIncomeRatio + BorrowerAPR + EmploymentStatus + EmploymentDuration +
## IncomeRange + Term + BorrowerState + Occupation, data = multi_data_samp_f)
## f11: lm(formula = CurrentDelinquencies ~ CreditScoreMean + HomeMortgage +
## DebtToIncomeRatio + BorrowerAPR + EmploymentStatus + EmploymentDuration +
## IncomeRange + Term + BorrowerState + Occupation + LoanOriginationDate,
## data = multi_data_samp_f)
## f12: lm(formula = CurrentDelinquencies ~ CreditScoreMean + HomeMortgage +
## DebtToIncomeRatio + BorrowerAPR + EmploymentStatus + EmploymentDuration +
## IncomeRange + Term + BorrowerState + Occupation + LoanOriginationDate +
## IncomeVerifiable, data = multi_data_samp_f)
##
## ==========================================================================================================================================================================================
## f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12
## ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 6.852*** 6.983*** 7.934*** 6.224** 3.023 2.963 3.750 4.141 5.104 0.500 3.905 3.905
## (1.503) (1.565) (1.597) (2.041) (2.335) (2.343) (2.482) (2.539) (2.886) (4.270)
## CreditScoreMean -0.009*** -0.009*** -0.011*** -0.009** -0.006 -0.006 -0.006 -0.006 -0.006 -0.003 -0.002 -0.002
## (0.002) (0.002) (0.002) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.005)
## HomeMortgage: True/False 0.087 0.316 0.281 0.360 0.323 0.481 0.476 0.424 0.406 1.111 1.111
## (0.281) (0.290) (0.290) (0.285) (0.293) (0.302) (0.302) (0.367) (0.501)
## DebtToIncomeRatio -0.152 -0.145 -0.169 -0.164 -0.250 -0.241 -0.145 -0.238 -2.379 -2.379
## (0.159) (0.159) (0.156) (0.157) (0.166) (0.167) (0.231) (0.297)
## BorrowerAPR 2.512 5.295* 5.360* 4.784* 4.849* 3.349 4.920 3.930 3.930
## (1.875) (2.081) (2.090) (2.143) (2.148) (2.596) (3.603)
## EmploymentStatus: Full-time/Employed 0.695* 0.728* 0.675* 0.585 0.302 0.612 -1.243 -1.243
## (0.326) (0.332) (0.338) (0.359) (0.427) (0.555)
## EmploymentStatus: Other/Employed 1.029 1.028 0.809 0.786 0.636 1.136 1.048 1.048
## (0.597) (0.599) (0.631) (0.633) (0.709) (1.001)
## EmploymentStatus: Part-time/Employed -0.765 -0.717 -1.436 -1.475 -0.878 -0.409 -4.974 -4.974
## (1.099) (1.105) (1.207) (1.210) (1.392) (1.836)
## EmploymentStatus: Retired/Employed 3.444* 3.516* 3.320* 3.246 2.558 2.979 4.542 4.542
## (1.631) (1.640) (1.643) (1.649) (1.819) (2.132)
## EmploymentStatus: Self-employed/Employed 1.472 1.540 1.341 1.265 1.444 1.946 1.131 1.131
## (0.786) (0.798) (0.817) (0.824) (0.915) (1.187)
## EmploymentDuration 0.001 0.001 0.001 0.002 0.002 -0.007 -0.007
## (0.001) (0.001) (0.001) (0.002) (0.002)
## IncomeRange: $100,000+/$1-24,999 -0.927 -0.904 -0.675 -1.012 1.455 1.455
## (0.672) (0.674) (0.777) (1.075)
## IncomeRange: $25,000-49,999/$1-24,999 -0.547 -0.497 -0.270 -0.131 -2.575 -2.575
## (0.615) (0.619) (0.692) (0.874)
## IncomeRange: $50,000-74,999/$1-24,999 -0.957 -0.885 -0.593 -1.021 0.554 0.554
## (0.601) (0.609) (0.684) (0.907)
## IncomeRange: $75,000-99,999/$1-24,999 -1.249 -1.148 -0.698 -1.155 -4.421 -4.421
## (0.670) (0.684) (0.809) (1.172)
## Term -0.010 -0.018 0.004 -0.010 -0.010
## (0.013) (0.015) (0.023)
## BorrowerState: AZ 3.299* 4.097* 9.564 9.564
## (1.335) (1.639)
## BorrowerState: CA -0.429 -0.556 -2.044 -2.044
## (0.977) (1.250)
## BorrowerState: CT -0.263 -0.223 -2.282 -2.282
## (1.392) (1.608)
## BorrowerState: DE -1.384 -1.756 5.635 5.635
## (1.750) (2.347)
## BorrowerState: FL -1.019 -1.380 2.050 2.050
## (0.961) (1.167)
## BorrowerState: GA 0.538 0.690 0.878 0.878
## (0.928) (1.173)
## BorrowerState: IA -0.566 0.748 -1.444 -1.444
## (1.726) (2.856)
## BorrowerState: ID -1.564 -0.568 2.153 2.153
## (1.756) (2.385)
## BorrowerState: IL -1.248 -1.459 2.340 2.340
## (0.917) (1.157)
## BorrowerState: IN -0.851 -0.908 2.975 2.975
## (1.096) (1.306)
## BorrowerState: KS -0.146 -0.304 -2.384 -2.384
## (1.200) (1.445)
## BorrowerState: MA -0.248 0.300 5.339 5.339
## (1.294) (1.604)
## BorrowerState: MD -0.488 -0.082 2.746 2.746
## (1.339) (1.740)
## BorrowerState: MI -0.918 -0.874 -1.265 -1.265
## (1.036) (1.238)
## BorrowerState: MN -0.488 -1.176 -2.159 -2.159
## (1.036) (1.335)
## BorrowerState: MO -0.762 -0.052 -2.999 -2.999
## (1.065) (1.286)
## BorrowerState: MS -0.892 -1.501 2.432 2.432
## (1.069) (1.247)
## BorrowerState: NC 1.076 1.602 -2.228 -2.228
## (1.050) (1.362)
## BorrowerState: NH -0.938 -2.577 -0.402 -0.402
## (1.408) (1.837)
## BorrowerState: NJ -0.320 -0.282 -2.195 -2.195
## (1.094) (1.369)
## BorrowerState: NV 0.878 0.107 -2.546 -2.546
## (1.462) (1.681)
## BorrowerState: NY -0.962 -1.075 -0.456 -0.456
## (0.964) (1.212)
## BorrowerState: OH -0.905 -0.947 -1.945 -1.945
## (1.117) (1.475)
## BorrowerState: OK -1.031 -0.561 -2.608 -2.608
## (1.774) (2.588)
## BorrowerState: OR -0.988 -0.531 -2.426 -2.426
## (1.769) (2.077)
## BorrowerState: PA -0.272 -0.126 -1.379 -1.379
## (1.091) (1.417)
## BorrowerState: RI -0.543 -0.675 -1.720 -1.720
## (1.526) (1.910)
## BorrowerState: SC -1.560 -1.778 0.332 0.332
## (1.195) (1.428)
## BorrowerState: TN -1.553 -1.956 3.838 3.838
## (1.382) (1.619)
## BorrowerState: TX -1.019 -0.312 -4.225 -4.225
## (1.133) (1.423)
## BorrowerState: UT -0.107 0.116 -3.605 -3.605
## (1.835) (2.178)
## BorrowerState: VA -0.946 -0.687 -2.196 -2.196
## (1.144) (1.498)
## BorrowerState: WA -0.878 -0.153 -2.513 -2.513
## (1.759) (2.459)
## BorrowerState: WI -0.999 -1.115 -1.289 -1.289
## (1.035) (1.234)
## BorrowerState: WV -1.098 -0.766 -8.660 -8.660
## (1.743) (2.175)
## Occupation: Accountant/CPA 2.202 -5.351 -5.351
## (2.029)
## Occupation: Administrative Assistant 2.658 -2.720 -2.720
## (1.764)
## Occupation: Analyst 2.136 -4.407 -4.407
## (1.988)
## Occupation: Architect 1.576 -3.253 -3.253
## (2.520)
## Occupation: Attorney 1.321 -2.576 -2.576
## (2.134)
## Occupation: Car Dealer 2.637 -1.139 -1.139
## (2.545)
## Occupation: Chemist 1.767 -0.513 -0.513
## (2.590)
## Occupation: Clerical 3.214 -2.025 -2.025
## (1.816)
## Occupation: Computer Programmer 1.199 2.066 2.066
## (2.023)
## Occupation: Construction -1.154 3.490 3.490
## (2.046)
## Occupation: Doctor 2.624 -2.144 -2.144
## (2.495)
## Occupation: Engineer - Chemical 0.578 -0.933 -0.933
## (2.554)
## Occupation: Engineer - Electrical 3.392 -3.757 -3.757
## (2.787)
## Occupation: Executive 1.339 -0.495 -0.495
## (1.821)
## Occupation: Food Service 1.423 -0.754 -0.754
## (2.198)
## Occupation: Food Service Management 2.576 0.378 0.378
## (2.029)
## Occupation: Laborer 0.258 1.444 1.444
## (2.159)
## Occupation: Military Enlisted 1.666 0.182 0.182
## (2.574)
## Occupation: Nurse (LPN) -0.788 -0.082 -0.082
## (2.397)
## Occupation: Nurse (RN) 1.894 4.383 4.383
## (1.921)
## Occupation: Nurse's Aide 0.148 2.865 2.865
## (2.140)
## Occupation: Other 1.716 -0.231 -0.231
## (1.550)
## Occupation: Police Officer/Correction Officer 2.253 3.590 3.590
## (2.076)
## Occupation: Professional 2.583 -1.119 -1.119
## (1.731)
## Occupation: Professor 1.525 0.427 0.427
## (2.247)
## Occupation: Retail Management 1.103 -2.043 -2.043
## (2.590)
## Occupation: Sales - Commission 1.449 -2.057 -2.057
## (2.071)
## Occupation: Sales - Retail 1.582 -1.181 -1.181
## (1.887)
## Occupation: Scientist 2.109 0.146 0.146
## (2.576)
## Occupation: Skilled Labor 1.573 -1.937 -1.937
## (1.853)
## Occupation: Social Worker 1.237 -1.528 -1.528
## (2.543)
## Occupation: Teacher 2.255 0.635 0.635
## (1.834)
## Occupation: Tradesman - Mechanic 1.966 -0.759 -0.759
## (2.431)
## Occupation: Truck Driver 2.081 -2.861 -2.861
## (2.437)
## LoanOriginationDate2007-03-07 00 x 00 x 00 4.828 4.828
##
## LoanOriginationDate2007-03-09 00 x 00 x 00 3.327 3.327
##
## LoanOriginationDate2007-03-27 00 x 00 x 00 0.881 0.881
##
## LoanOriginationDate2007-04-24 00 x 00 x 00 4.413 4.413
##
## LoanOriginationDate2007-05-08 00 x 00 x 00 5.385 5.385
##
## LoanOriginationDate2007-05-15 00 x 00 x 00 4.514 4.514
##
## LoanOriginationDate2007-05-31 00 x 00 x 00 23.737 23.737
##
## LoanOriginationDate2007-07-06 00 x 00 x 00 -0.536 -0.536
##
## LoanOriginationDate2007-07-11 00 x 00 x 00 -4.505 -4.505
##
## LoanOriginationDate2007-07-17 00 x 00 x 00 5.573 5.573
##
## LoanOriginationDate2007-08-16 00 x 00 x 00 0.943 0.943
##
## LoanOriginationDate2007-12-18 00 x 00 x 00 1.747 1.747
##
## LoanOriginationDate2008-01-15 00 x 00 x 00 3.720 3.720
##
## LoanOriginationDate2008-01-24 00 x 00 x 00 8.198 8.198
##
## LoanOriginationDate2008-03-04 00 x 00 x 00 0.612 0.612
##
## LoanOriginationDate2008-03-20 00 x 00 x 00 -7.020 -7.020
##
## LoanOriginationDate2008-03-28 00 x 00 x 00 3.309 3.309
##
## LoanOriginationDate2008-05-27 00 x 00 x 00 0.991 0.991
##
## LoanOriginationDate2008-06-03 00 x 00 x 00 5.473 5.473
##
## LoanOriginationDate2008-06-10 00 x 00 x 00 -2.577 -2.577
##
## LoanOriginationDate2008-07-03 00 x 00 x 00 5.252 5.252
##
## LoanOriginationDate2008-07-22 00 x 00 x 00 -1.558 -1.558
##
## LoanOriginationDate2008-08-11 00 x 00 x 00 6.071 6.071
##
## LoanOriginationDate2009-08-13 00 x 00 x 00 -2.851 -2.851
##
## LoanOriginationDate2009-11-06 00 x 00 x 00 0.588 0.588
##
## LoanOriginationDate2010-02-17 00 x 00 x 00 -1.660 -1.660
##
## LoanOriginationDate2010-02-26 00 x 00 x 00 0.436 0.436
##
## LoanOriginationDate2010-08-20 00 x 00 x 00 5.345 5.345
##
## LoanOriginationDate2010-09-28 00 x 00 x 00 5.413 5.413
##
## LoanOriginationDate2010-11-24 00 x 00 x 00 5.528 5.528
##
## LoanOriginationDate2011-01-03 00 x 00 x 00 -5.208 -5.208
##
## LoanOriginationDate2011-03-02 00 x 00 x 00 -8.147 -8.147
##
## LoanOriginationDate2011-05-27 00 x 00 x 00 0.424 0.424
##
## LoanOriginationDate2011-06-08 00 x 00 x 00 -2.675 -2.675
##
## LoanOriginationDate2011-06-30 00 x 00 x 00 -1.282 -1.282
##
## LoanOriginationDate2011-08-04 00 x 00 x 00 -0.378 -0.378
##
## LoanOriginationDate2011-08-31 00 x 00 x 00 10.966 10.966
##
## LoanOriginationDate2011-10-12 00 x 00 x 00 -1.318 -1.318
##
## LoanOriginationDate2011-12-29 00 x 00 x 00 -0.071 -0.071
##
## LoanOriginationDate2012-05-16 00 x 00 x 00 4.102 4.102
##
## LoanOriginationDate2012-06-27 00 x 00 x 00 0.717 0.717
##
## LoanOriginationDate2012-07-11 00 x 00 x 00 -1.872 -1.872
##
## LoanOriginationDate2012-07-24 00 x 00 x 00 1.406 1.406
##
## LoanOriginationDate2012-08-07 00 x 00 x 00 -5.483 -5.483
##
## LoanOriginationDate2012-08-08 00 x 00 x 00 1.624 1.624
##
## LoanOriginationDate2012-08-28 00 x 00 x 00 -17.178 -17.178
##
## LoanOriginationDate2012-09-12 00 x 00 x 00 -1.884 -1.884
##
## LoanOriginationDate2012-09-19 00 x 00 x 00 0.472 0.472
##
## LoanOriginationDate2012-09-24 00 x 00 x 00 -4.669 -4.669
##
## LoanOriginationDate2012-09-27 00 x 00 x 00 4.861 4.861
##
## LoanOriginationDate2012-11-30 00 x 00 x 00 5.741 5.741
##
## LoanOriginationDate2012-12-07 00 x 00 x 00 10.430 10.430
##
## LoanOriginationDate2013-01-25 00 x 00 x 00 -0.055 -0.055
##
## LoanOriginationDate2013-02-25 00 x 00 x 00 6.763 6.763
##
## LoanOriginationDate2013-05-30 00 x 00 x 00 -10.563 -10.563
##
## LoanOriginationDate2013-07-09 00 x 00 x 00 -4.739 -4.739
##
## LoanOriginationDate2013-07-10 00 x 00 x 00 8.620 8.620
##
## ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.102 0.102 0.133 0.144 0.217 0.219 0.248 0.251 0.456 0.601 1.000 1.000
## adj. R-squared 0.096 0.090 0.114 0.119 0.164 0.159 0.165 0.162 0.158 0.013
## sigma 1.578 1.582 1.556 1.552 1.512 1.516 1.511 1.514 1.517 1.643
## F 17.102 8.548 7.042 5.761 4.066 3.672 2.985 2.815 1.528 1.022
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.001 0.001 0.040 0.471
## Log-likelihood -285.835 -285.787 -262.274 -261.350 -255.014 -254.844 -252.188 -251.865 -229.104 -207.175 Inf Inf
## Deviance 375.767 375.528 334.268 329.948 301.779 301.057 290.005 288.687 209.508 153.839 0.000 0.000
## AIC 577.670 579.573 534.547 534.700 532.028 533.687 536.377 537.730 562.208 586.350 -Inf -Inf
## BIC 586.762 591.695 549.326 552.435 564.542 569.157 583.670 587.979 715.911 840.551 -Inf -Inf
## N 153 153 142 142 142 142 142 142 142 142 142 142
## ==========================================================================================================================================================================================
## Warning in rm(p): object 'p' not found
Through this analysis I’ve discovered that the ‘CreditScoreMean’ and ‘CurrentDelinquencies’ are inversely related.
Through the analysis I’ve also dsicovered that on average borrowers who have a ‘HomeMortgage’ have fewer “CurrentDelinquencies’.
Through this analysis I also discivered that the higher the Borrower’s ‘DebtToIncomeRatio’ value, the higher the “BorrowerAPR’ / annual effective interest rate.
The linear model I created attained an R-squared result of 1.00 which I think is about maximum possible. So perhaps this is a good model.
Final Plot 1 shows the relationship between the dependent or response variable
‘CurrentDelinquencies’ and the independent variable ’CreditScoreMean which was
divided into six levels with the cut function. A regresssion line was plotted
through the pints to show that the two variables are inversely related
Final Plot 2 shows the relationship between the dependent variable
‘CurrentDelinquencies’ which was plotted on the x axis instead of the y axis
for ease of labeling of the tick labels. Color was added to this plot to show
the interrelationship of the ‘CreditScoreMean’ variable. This plot clearly
shows that the lower credit scores have the greatest number of payment
delinquencies.
Final Plot three shows the relationship between ‘LoanCode’ which are the loan
categories on the y axis and ‘CurrentDelinquencies’ on the x axis. The
‘CreditScoreMean’ variable was cut into seven groupings and used interacively
in this plot. This plot demonstrates the majore significance of the
‘CreditScoreMean’ or most credit rating systems in general. Here we can see that
very few delinquencies show up for the lowest credit scores irregardless of
laon purpose. This is most likey because lenders are not lending that much to
borrowers with very low credit scores. On the other hand we can also see that
the borrowers with the highest credit scores tend to have the fewest number
of delinquent payments irregardles of the purpose of the loans. We can also
see that certain categories of loans are not as prevalent like boat loans and
RV loans for example. example. variabe
Conclusions:
This was a challenging project in a number of ways. First of all, the data set was
rather large in both the number of rows and columns. I found that one of the greatest
challenges for me was to get the coding right for making the plots. Once I could
accomplish that task I began to see the shortcomings in the data and had to start
making adjustments and modifications to get the plots to work. I think learning to
utilize the ggplot2 package and a number of other packages made the task go much
quicker than it otherwise would have without them. I thought it was very interesting
how the R-squared value of the model I created attained a 1.00 result which indicates
that the model predicts very well. If I did future work in this project I would
experiment with other types of graphs.
References:
German Rodriguez Session Info
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_0.8.0.1 plyr_1.8.4 stringr_1.4.0 ggpubr_0.2 magrittr_1.5
## [6] ggplot2_3.1.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.0 knitr_1.21 tidyselect_0.2.5 munsell_0.5.0
## [5] colorspace_1.4-0 R6_2.4.0 rlang_0.3.1 tools_3.5.2
## [9] grid_3.5.2 gtable_0.2.0 xfun_0.5 withr_2.1.2
## [13] htmltools_0.3.6 assertthat_0.2.0 yaml_2.2.0 lazyeval_0.2.1
## [17] digest_0.6.18 tibble_2.0.1 crayon_1.3.4 purrr_0.3.1
## [21] glue_1.3.0 evaluate_0.13 rmarkdown_1.11 labeling_0.3
## [25] stringi_1.3.1 compiler_3.5.2 pillar_1.3.1 scales_1.0.0
## [29] pkgconfig_2.0.2
## [1] "Wed Mar 6 23:19:47 2019"